Prediction of OCR accuracy using simple image features

نویسندگان

  • L. R. Blando
  • Junichi Kanai
  • Thomas A. Nartker
چکیده

A classifier for predicting the character accuracy achieved by any Optical Character Recognition (OCR) system on a given page is presented. This classifier is based on measuring the amount of white speckle, the amount of character fragments, and overall size information in the page. No output from the OCR system is used. The given page is classified as either “good” quality (i.e., high OCR accuracy expected) or “poor” (i.e., low OCR accuracy expected). Results of processing 639 pages show a recognition rate of approximately 85%. This performance compares favorably with the ideal-case performance of a prediction method based upon the number of reject-markers in OCR generated text.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Prediction of OCR Accuracy

The accuracy of all contemporary OCR technologies varies drastically as a function of input image quality [Rice 92, Rice 93, Chen 93, Rice 94]. Given high quality images, many devices consistently deliver output text in excess of 99% correct. For low quality images, even images which are easily read by a human, output accuracy is frequently below 90%. This extreme sensitivity to quality is well...

متن کامل

Prediction of OCR accuracy using a Neural Network

A method for predicting the accuracy achieved by an OCR system on an input image is presented. It is assumed that there is an ideal prediction function. A neural network is trained to estimate the unknown ideal function. In this project, multilayer perceptrons were trained to predict the character accuracy performance of two OCR systems using the backpropagation training method. The results sho...

متن کامل

Development of an Ensemble Multi-stage Machine for Prediction of Breast Cancer Survivability

Prediction of cancer survivability using machine learning techniques has become a popular approach in recent years. ‎In this regard, an important issue is that preparation of some features may need conducting difficult and costly experiments while these features have less significant impacts on the final decision and can be ignored from the feature set‎. ‎Therefore‎, ‎developing a machine for p...

متن کامل

Content-Oriented Categorization of Document Images

We have developed a technique that categorizes document images based on their content. Unlike conventional methods that use optical character recognition (OCR), we convert document images into word shape tokens, a shape-based representation of words. Because we have only to recognize simple graphical features from image, this process is much faster than OCR. Although the mapping between word sh...

متن کامل

OCR for Handwritten Kannada Language Script

The optical character recognition (OCR) is the process of converting textual scanned image into a computer editable format. The proposed OCR system is for complex handwritten Kannada characters. One of the major challenges faced by Kannada OCR system is recognition of handwritten text from an image. The input text image is subjected to preprocessing and then converted into binary image. Segment...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1995